Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length
نویسندگان
چکیده
Automatic word segmentation is a basic requirement for unsupervised learning in morphological analysis. In this paper, we formulate a novel recursive method for minimum description length (MDL) word segmentation, whose basic operation is resegmenting the corpus on a prefix (equivalently, a suffix). We derive a local expression for the change in description length under resegmentation, i.e., one which depends only on properties of the specific prefix (not on the rest of the corpus). Such a formulation permits use of a new and efficient algorithm for greedy morphological segmentation of the corpus in a recursive manner. In particular, our method does not restrict words to be segmented only once, into a stem+affix form, as do many extant techniques. Early results for English and Turkish corpora are promising.
منابع مشابه
Fully Unsupervised Word Segmentation with BVE and MDL
Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to gene...
متن کاملUnsupervised Segmentation of Poisson Data
This paper describes a new approach to the analysis of Poisson point processes, in time (1D) or space (2D), which is based on the minimum description length (MDL) framework. Specifically, we describe a fully unsupervised recursive segmentation algorithm for 1D and 2D observations. Experiments illustrate the good performance of the proposed methods.
متن کاملUnsupervised Word Induction Using Mdl Criterion
Unsupervised learning of units (phonemes, words, phrases, etc.) is important to the design of statistical speech and NLP systems. This paper presents a general source-coding framework for inducing words from natural language text without word boundaries. An efficient search algorithm is developed to optimize the minimum description length (MDL) induction criterion. Despite some seemingly over-s...
متن کاملUnsupervised SAR Image Segmentation using Recursive Partitioning
We present a new approach to SAR image segmentation based on a Poisson approximation to the SAR amplitude image It has been established that SAR amplitude images are well approximated using Rayleigh distributions We show that with suitable modi cations we can model piecewise homogeneous regions such as tanks roads scrub etc within the SAR amplitude image using a Poisson model that bears a known...
متن کاملBootstrap Voting Experts
BOOTSTRAP VOTING EXPERTS (BVE) is an extension to the VOTING EXPERTS algorithm for unsupervised chunking of sequences. BVE generates a series of segmentations, each of which incorporates knowledge gained from the previous segmentation. We show that this method of bootstrapping improves the performance of VOTING EXPERTS in a variety of unsupervised word segmentation scenarios, and generally impr...
متن کامل